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We introduce a machine learning model to predict atomization energies of a diverse set of organic 
molecules, based on nuclear charges and atomic positions only. The problem of solving the molecular 
Schrodinger equation is mapped onto a non-linear statistical regression problem of reduced complex- 
ity. Regression models are trained on and compared to atomization energies computed with hybrid 
density-functional theory. Cross-validation over more than seven thousand small organic molecules 
yields a mean absolute error of ~10 kcal/mol. Applicability is demonstrated for the prediction of 
molecular atomization potential energy curves. 



Solving the Schrodinger equation (SE), = , for 
assemblies of atoms is a fundamental problem in quantum 
mechanics. Alas, solutions that are exact up to numerical 
precision are intractable for all but the smallest systems 
with very few atoms. Hierarchies of approximations have 
evolved, usually trading accuracy for computational effi- 
ciency PQ . Conventionally, the external potential, defined 
by a set of nuclear charges {Zj} and atomic positions 
{R/}, uniquely determines the Hamiltonian H of any 
system, and thereby the potential energy by optimizing 
Vl/,[2] -ff({Zj,R/}) i— 1 -^ E. For a diverse set of organic 
molecules, we show that one can use machine learning 
(ML) instead, {Z/,R/} i— » E. Thus, we circumvent 
the task of explicitly solving the SE by training once 
a machine on a finite subset of known solutions. Since 
many interesting questions in physics require to repeat- 
edly solve the SE, the highly competitive performance 
of our ML approach may pave the way to large scale 
exploration of molecular energies in chemical compound 
space [Sill]. 

ML techniques have recently been used with success to 
map the problem of solving complex physical differential 
equations to statistical models. Successful attempts in- 
clude solving Fokker-Planck stochastic differential equa- 
tions [5] , parameterizing interatomic force fields for fixed 
chemical composition j6j [7j , and the discovery of novel 
ternary oxides for batteries |S]. Motivated by these, and 
other related efforts [51-112] . we develop a non-linear re- 
gression ML model for computing molecular atomization 
energies in chemical compound space [3]. Our model 
is based on a measure of distance in compound space 
that accounts for both stoichiometry and configurational 
variation. After training, energies are predicted for new 
(out-of-sample) molecular systems, differing in composi- 
tion and geometry, at negligible computational cost, i.e. 
milli seconds instead of hours on a conventional CPU. 
While the model is trained and tested using atomization 
energies calculated at the hybrid density-functional the- 



ory (DFT) level [H [T31 03] , any other training set or level 
of theory could be used as a starting point for subsequent 
ML training. Cross-validation on 7165 molecules yields 
a mean absolute error of 9.9 kcal/mol, which is an or- 
der of magnitude more accurate than counting bonds or 
semi-empirical quantum chemistry. 

We use the GDB data base, a library of nearly one 
billion organic molecules that are stable and syntheti- 
cally accessible according to organic chemistry rules [TH] . 
While potentially applicable to any stoichiometry, as a 
proof of principle we restrict ourselves to small organic 
molecules. Specifically, we define a controlled test-bed 
consisting of all 7165 organic molecules from the GDB 
data base with up to seven "heavy" atoms that contain 
C, N, O, or S, being saturated with hydrogen atoms. At- 
omization energies range from -800 to -2000 kcal/mol. 
Structural features include a rich variety of chemistry 
such as double, and triple-bonds; (hetero)cycles, carboxy, 
cyanide, amide, alcohol, and epoxy-groups. For each of 
the many stoichiometrics, many constitutional (differing 
chemical bonds) but no conformational isomers are part 
of this data base. Based on the string representation 
of molecules in the data base, we generated Cartesian 
geometries with OpenBabel 16J . Thereafter, the PBE0 
approximation to hybrid DFT [171 118] in converged nu- 
merical basis, as implemented in the FHI-aims code |19] 
(tight settings/tier2 basis set), was used to compute ref- 
erence atomization energies for training. Our choice of 
the PBE0 hybrid functional is motivated by small errors 
(< 5 kcal/mol) for thermo-chemistry data that includes 
molecular atomization energies [20] . 

One of the most important ingredients for ML is the 
choice of an appropriate data representation that reflects 
prior knowledge of the application domain, i.e. a model 
of the underlying physics. A variety of such "descrip- 
tors" are used by statistical methods for chem- and bio- 
informatics applications [21H23] . For modeling atomiza- 
tion energies, we use the same molecular information that 
enters the Hamiltonian for an electronic structure calcu- 



lation, namely the set of Cartesian coordinates, {R/}, 
and nuclear charges, \Zj\. Our representation consists 
of atomic energies, and the inter-nuclear Coulomb repul- 
sion operator, Specifically, we represent any molecule by 
a "Coulomb" matrix M, 



M u = 



ZjZj 



V / = J, 



(1) 



L |Rj-R.j| 

Here, off-diagonal elements correspond to the Coulomb 
repulsion between atoms / and J, while diagonal ele- 
ments encode a polynomial fit of atomic energies to nu- 
clear charge. 

Using ML we attempt to construct a non-linear map 
between molecular characteristics and atomization ener- 
gies. This requires a measure of molecular (dis)similarity 
that is invariant with respect to translations, rotations, 
and the index ordering of atoms. To this end, we mea- 
sure the distance between two molecules by the Eu- 
clidean norm of their diagonalized Coulomb matrices: 
d(M,M') = d(e,e') = 



e'rl 2 , where e are the 



eigenvalues of M in order of decreasing absolute value. 
For matrices that differ in dimensionality, c of the smaller 
system is extended by zeros. Note that by represent- 
ing chemical compound space in this way, (i) any sys- 
tem is uniquely encoded because stoichiometry as well 
as atomic configuration are explicitly accounted for, (ii) 
symmetrically equivalent atoms contribute equally, (iii) 
the diagonalized M is invariant with respect to atomic 
permutations, translations, and rotations, and (iv) the 
distance is continuous with respect to small variations in 
inter-atomic distances or nuclear charges. As discussed 
in Ref. [24], these are all crucial criteria for representing 
atomistic systems within statistical models. 

In Fig. [TJ relative atomization energies, as a function 
of g?(M, M'), and a histogram of distances are shown for 
all pairs of molecules in our data set. The inset exempli- 
fies the distances between three molecular species, pyrrol, 
thiophene, and ethanol: Within our measure of similar- 
ity the nitrogen containing aromatic heterocycle pyrrol 
is ~10 times farther away from its sulfur containing ana- 
logue, thiophene, than from ethanol. This is due to the 
large difference in nuclear charges between atoms from 
different rows in the periodic table. 

Within our ML model [25 27J, the energy of a molecule 
M is a sum over weighted Gaussians, 



N 



E est (M) = > c^exp 



l 



I 

~2a* 



rf(M,M,) 2 



(2) 



where i runs over all molecules Mj in the training set. 
Regression coefficients {a{\ and length-scale parameter 
a are obtained from training on {Mi, i?[ e ^}. Note that 
each training molecule i contributes to the energy not 
only according to its distance, but also according to its 
specific weight on. The {-E"™^} were computed at PBEO 
DFT level of theory. 
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FIG. 1: (Color online) Top: Distribution of distances, 
d(M, M'), for all molecular pairs occurring in the first 
7165 small organic molecules from the GDB data base [15] . 
The inset exemplifies two distances, pyrrol/ethanol and 
pyrrol/thiophene (N: Blue, O: Red, S: Yellow, C: Black, H: 
White). Bottom: Absolute differences in atomization energies 
between M and M' as a function of d(M, M'). 



To determine {cti}, we used kernel ridge regression |26) . 
This regularized model limits the norm of regression coef- 
ficients, {a{\, thereby ensuring the transferability of the 
model to new compounds. For given length-scale a and 
regularization parameter A, the explicit solution to the 
minimization problem, 



'E est (Mi)-E™ f ) 2 + \J2< 



(3) 



is given by a = (K + AI)- 1 E re - f , K i3 = 
exp[— d(Mi, M,,) 2 /(2<7 2 )] being the kernel matrix of all 
training molecules, and I denoting the identity matrix. 

We used stratified [33] five- fold cross-validation [2"61 |2"T] 
for model selection and to estimate performance. Param- 
eters A and a were determined in an inner loop of five- 
fold cross-validation using a logarithmically scaling grid. 
This procedure is routinely applied in machine learning 
and statistics to avoid over-fitting and overly optimistic 
error estimates. 
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The dependence of the cross-validated ML perfor- 
mance on the number of molecules in training set, N, is 
illustrated in Fig. [2] (top) . When increasing N from 500 
to 7000, the mean absolute error (MAE) falls off from 
more than 17 kcal/mol to less than 10 kcal/mol. Fur- 
thermore, the width a of the Gaussian kernel decreases 
from 460 to 25 on the distance scale of Fig. [I] A small 
a emphasizes compound pairs for which the distance is 
small, whereas a larger a allows for contributions from 
distant pairs. This is to be expected for increased number 
of training molecules. Because of the discrete nature of 
chemical space (nuclear charges can only assume integer 
values), however, we do not expect continuous coverage 
for N —¥ oo, implying that a will converge to a small 
but finite value. The regularization hyperparameter A 
remains small throughout, consistent with the fact that 
we model noise-free numerical solutions of the approx- 
imated Schrddinger equation. An asymptotic fit of the 
form ~ 1/ y/~N, based on statistical theory [26] [29] sug- 
gests that the MAE can be lowered to ~ 7.6 kcal/mol 
for N — > oo. It is remarkable that already for the 
here presented, relatively small training set sizes, ML 
achieves errors of roughly one percent on the relevant 
scale of energies, clearly outperforming bond counting or 
semi-empirical quantum chemistry methods. The cross- 
validated performance for a training set size of N = 1000 
is displayed in Fig. [2] (bottom). There is good correla- 
tion with the DFT data. For comparison, corresponding 
correlations are shown for bond counting 30], and semi- 
empirical quantum chemistry (PM6 [31]) computed with 
M0PAC [32 . While the latter two methods exhibit a sys- 
tematic shift in slope, the inset highlights that the ML 
correlation accurately reproduces clustering, and slope of 
one. 

In order to assess transferability and applicability of 
our model to chemical compound space, we use a ML 
model trained on N — 1000 molecules (model lk). The 
training set of model lk contains all small molecules with 
3 to 5 heavy atoms, and a randomized stratified selection 
of larger compounds covering the entire energy range. 
The thousand Coulomb matrices corresponding to the 
OpenBabel configurations were included as well as four 
additional Coulomb matrices per molecule. These ad- 
ditional matrices were scaled in order to represent the 
repulsive wall, the dissociative limit, and the energy 
minimum at / = 1 |33[ 134] . All predictions are made 
for molecules that were not used during training of the 
model. 

For testing the transferability, we applied the lk model 
to the remaining 6k molecules. The calculations yield er- 
rors that hardly change from the estimated performance 
in the training with a MAE of 15.2 kcal/mol. For the 
selected molecular subset of the seven thousand smallest 
molecules in the GDB database [15] , we therefore con- 
clude that training on 15% of the molecules permits 
predictions of atomization energies for the remaining 85% 
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FIG. 2: (Color online) Top: Cross-validated ML errors as 
a function of number of molecules in training set, N. Bot- 
tom: For N = 1000, correlation of DFT-PBE0 [TTI [18] results 
(E ref ) with ML (cross validated) based estimates (E est ) of 
atomization energies. Correlations for bond counting |30| 
and semi-empirical quantum chemistry (PM6[3l]) are also 
shown. Corresponding RMSE (root mean square error) /MAE 
(mean absolute error) for bond counting, PM6, and ML are 
75.0/71.0, 75.1/73.1, 30.1/14.9 kcal/mol, respectively. 



with an accuracy of roughly 15 kcal/mol. 

For probing the applicability, we investigated whether 
the lk model can also be useful beyond the equilibrium 
geometries. Specifically, we calculated the functional de- 
pendence of atomization energies on scaling Cartesian ge- 
ometries by a factor, /. From the 6k molecules (not used 
for training) we picked four which exhibit chemical diver- 
sity. Specifically, these molecules contain single bonds 
and branching only (C7H16), a double bond (C 6 H 12 ), 
triple bonds including nitrogen (C6NH5), and a sulfur 
containing cycle with a hydroxy group (C4SH3OH). The 
resulting ML atomization energy curves (Fig. [3]) correctly 
distinguish between the molecules, closely reproduce the 
DFT energy at / = 1, and appear continuous and dif- 
ferentiable throughout relevant bonding distances. For 
comparison, corresponding Morse potential curves are 
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also displayed. Their well-depth and exponential factor 
were explicitly fitted to the molecular DFT minimum, 
as well as repulsive wall and dissociative limit [34]. Al- 
beit slightly overestimating equilibrium distance and well 
depth for C4SH3OH and C 6 NH 5 , the ML model is in 
overall good agreement with the Morse potential curves. 
One can speculate if the better performance of the ML 
model for the larger molecules is due to the fact that 
in the total set larger molecules are more frequent than 
smaller molecules. Again, we stress the contrast that 
while the Morse potential curves were explicitly fitted 
for these four molecules, the ML model was obtained for 
a training set based on one thousand other molecules. 
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FIG. 3: (Color online) Energy of atomization curves of 
four molecules containing single bonds and branching only 
(C7H16), a double bond (C6H12), triple bonds including ni- 
trogen (C6NH5), and a sulfur containing cycle with a hydroxy 
group (C4SH3OH). (bottom to top in insets; black: Carbon; 
blue: Nitrogen; yellow: Sulfur; red: Oxygen; white: Hydro- 
gen) (DFT-PBEO, Morse potential, and ML model lk). 

We have developed a ML approach for modeling 
atomization energies in the chemical compound space 
of small organic molecules. For larger training sets, 
accuracies have been achieved that are competitive with 
mean-field electronic structure theory, at a fraction of 
the computational cost. We find good performance 
when making predictions for new molecules (trans- 
ferability) and when predicting atomization energies 
beyond the equilibrium geometry. Our representation 
of molecules as Coulomb matrices is inspired by the 
nuclear repulsion term in the molecular Hamiltonian, 
and free atom energies. Future extensions of our 
approach might permit rational compound design 
applications [35 - 37] as well as geometrical relaxations, 
chemical reactions, or molecular dynamics in various 
ensembles [55] . Finally, our results suggest that the 
Coulomb matrix, or improvements thereof, could be of 
interest as a descriptor beyond the presented application. 
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